KJ 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of the product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch.

(a)

Start R and use these commands to load the data:

> library(AppliedPredictiveModeling)
> data(ChemicalManufacturingProcess)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

# Load data
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

(b)

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in those missing values (e.g., see Sect. 3.8).

We’ll first generate a summary to get a first look at the data and to show missing values, then we’ll get a count of missing values per variable.

# Show missing value counts
dfchem <- ChemicalManufacturingProcess  # To avoid typing this many letters
data.frame(Missing=colSums(is.na(dfchem))) %>%
    filter(Missing > 0) %>%
    kbl(caption='Missing value counts') %>%
    kable_classic(full_width=F)
Missing value counts
Missing
ManufacturingProcess01 1
ManufacturingProcess02 3
ManufacturingProcess03 15
ManufacturingProcess04 1
ManufacturingProcess05 1
ManufacturingProcess06 2
ManufacturingProcess07 1
ManufacturingProcess08 1
ManufacturingProcess10 9
ManufacturingProcess11 10
ManufacturingProcess12 1
ManufacturingProcess14 1
ManufacturingProcess22 1
ManufacturingProcess23 1
ManufacturingProcess24 1
ManufacturingProcess25 5
ManufacturingProcess26 5
ManufacturingProcess27 5
ManufacturingProcess28 5
ManufacturingProcess29 5
ManufacturingProcess30 5
ManufacturingProcess31 5
ManufacturingProcess33 5
ManufacturingProcess34 5
ManufacturingProcess35 5
ManufacturingProcess36 5
ManufacturingProcess40 1
ManufacturingProcess41 1
# Use MICE to impute missing values
imp <- mice(dfchem, printFlag=F)
dfchem2 <- complete(imp)

# Verify no more missing values exist
print(paste0('Missing values after imputation: ', sum(is.na(dfchem2))))
## [1] "Missing values after imputation: 0"

(c)

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First, we’ll choose a random seed for reproducible results. Then we’ll split the data using createDataPartition(), which attempts to split the data in such a way that the training and test sets will an approximately proportionate distributions of the outcome variable.

# Set seed
set.seed(777)

# Split into train/test; createDataPartition generates indicies of the training set
train_indices <- createDataPartition(dfchem2$Yield, p=0.80, times=1, list=F)
dftrain <- dfchem2[train_indices,]
dftest <- dfchem2[-train_indices,]

# Examine train and test sets
hist(dftrain$Yield, main='Training set yield histogram')

hist(dftest$Yield, main='Test set yield histogram')

Now we’ll examine the predictors to look for correlations and examine their relationship to the outcome variable.

# Corr chart - not useful here because of the number of predictors
# chart.Correlation(dftrain)

# Corr plot
corr1 <- cor(dftrain)
corrplot(corr1, order='hclust', type='upper', diag=F)

# Show candidates for removal
print("Candidates for removal due to high correlation:")
## [1] "Candidates for removal due to high correlation:"
high_corr <- findCorrelation(corr1, cutoff=0.9, exact=T, verbose=F, names=F)
colnames(dftrain[,high_corr])
##  [1] "BiologicalMaterial02"   "BiologicalMaterial04"   "BiologicalMaterial12"  
##  [4] "ManufacturingProcess29" "ManufacturingProcess42" "ManufacturingProcess26"
##  [7] "ManufacturingProcess25" "ManufacturingProcess31" "ManufacturingProcess18"
## [10] "ManufacturingProcess40"
# Remove the highly correlated variables
dftrain2 <- dftrain[,-high_corr]
dftest2 <- dftest[,-high_corr]

Now we’ll try modeling. First, prepare a data frame for the results, separate the outcome and predictor variables, and set cross-validation parameters.

# Results data frame
dfr <- data.frame(matrix(nrow=10, ncol=3))
colnames(dfr) <- c('Model', 'Tuning.Parameters', 'RMSE')

# Separate outcome and predictors
trainx <- dftrain2 %>% dplyr::select(-Yield)
trainy <- dftrain2$Yield
testx <- dftest2 %>% dplyr::select(-Yield)
testy <- dftest2$Yield

# specify 10x cross-validation
ctrl <- trainControl(method='cv', number=10)

Linear regression

# Linear model
set.seed(77)
fit <- train(x=trainx, y=trainy, method='lm', trControl=ctrl)
fit
## Linear Regression 
## 
## 144 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.286066  0.5084493  1.369764
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
dfr[1,] = data.frame(MOdel='LM', Tuning.Parameters='', RMSE=fit$results[['RMSE']])

# Stepwise linear model using MASS package/caret
set.seed(77)
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx) / 2, 0)))  # Max number of parameters to use
fit <- train(x=trainx, y=trainy, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid)
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
fit
## Linear Regression with Stepwise Selection 
## 
## 144 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     1.624664  0.2226359  1.316949
##    2     1.375821  0.4152041  1.162999
##    3     1.296868  0.4858057  1.071156
##    4     1.298437  0.5019131  1.059803
##    5     1.340354  0.4771684  1.115468
##    6     1.451502  0.4046808  1.179163
##    7     1.350447  0.4618300  1.112193
##    8     1.307533  0.5043358  1.066602
##    9     2.407935  0.5033313  1.346444
##   10     2.271019  0.4967220  1.334601
##   11     2.224866  0.4985500  1.333145
##   12     2.252773  0.4621222  1.354180
##   13     2.376942  0.4026359  1.431658
##   14     1.853431  0.5066380  1.218464
##   15     1.282595  0.4986697  1.043252
##   16     1.846756  0.5030727  1.219705
##   17     1.345225  0.4570793  1.101332
##   18     1.320995  0.4917882  1.097843
##   19     1.280712  0.5234311  1.064212
##   20     1.292781  0.5164534  1.061669
##   21     1.323369  0.4876111  1.079384
##   22     1.274669  0.5304642  1.061286
##   23     1.273597  0.5313222  1.077300
##   24     1.339381  0.5144554  1.068906
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 23.
dfr[2,] = data.frame(MOdel='LM', Tuning.Parameters=paste0('nvmax=', fit$bestTune[['nvmax']]), RMSE=min(fit$results[['RMSE']]))

Robust linear regression

# Robust linear model - this errors out with "'x' is singular: singular fits are not implemented in 'rlm'"
set.seed(77)
# fit <- train(x=trainx, y=trainy, method='rlm', preprocess=c('center', 'scale', 'pca'), trControl=ctrl)

## from Kayleahs
testData_minus_yield <- trainx
trans_test <- preProcess(testData_minus_yield, method = c("center", "scale", "YeoJohnson", "nzv"))
transformed_test <- predict(trans_test, testData_minus_yield)
fit <- train(transformed_test, 
                trainy, 
                method = "rlm", preProcess = "pca", Control = ctrl)

dfr[3,] = data.frame(Model='RLM', Tuning.Parameters='', RMSE=fit$results[['RMSE']])  # temp

Partial least squares

# PLS, tuneLength=5
set.seed(77)
fit <- train(x=trainx, y=trainy, method='pls', preprocess=c('center', 'scale'), trControl=ctrl, tuneLength=5)
fit
## Partial Least Squares 
## 
## 144 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##   1      1.676215  0.2522705  1.383035
##   2      1.674497  0.2541740  1.386033
##   3      1.589468  0.2770528  1.313656
##   4      3.508917  0.2431983  1.852400
##   5      3.091332  0.2582261  1.701353
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
dfr[4,] = data.frame(Model='PLS', Tuning.Parameters='tuneLength=5', RMSE=fit$results[['RMSE']])
         
# PLS, tunLength=20
set.seed(77)
fit <- train(x=trainx, y=trainy, method='pls', preprocess=c('center', 'scale'), trControl=ctrl, tuneLength=20)
fit
## Partial Least Squares 
## 
## 144 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.676215  0.2522705  1.3830349
##    2     1.674497  0.2541740  1.3860331
##    3     1.589468  0.2770528  1.3136558
##    4     3.508917  0.2431983  1.8523999
##    5     3.091332  0.2582261  1.7013526
##    6     3.606812  0.3286406  1.8396639
##    7     3.267749  0.3626882  1.6871394
##    8     2.849073  0.3859410  1.5535857
##    9     2.707362  0.3868188  1.4979158
##   10     2.554798  0.4163303  1.4226499
##   11     2.412650  0.4448560  1.3498822
##   12     2.902913  0.4779650  1.4479312
##   13     1.703657  0.5111518  1.1475910
##   14     1.303947  0.5818938  0.9918495
##   15     1.308174  0.6185297  0.9794667
##   16     1.256095  0.6403446  0.9573178
##   17     1.527881  0.6179692  1.0400251
##   18     1.764593  0.6134669  1.1002319
##   19     1.786071  0.6039765  1.1147107
##   20     1.868231  0.5853140  1.1452380
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 16.
dfr[5,] = data.frame(Model='PLS', Tuning.Parameters='tuneLength=20', RMSE=fit$results[['RMSE']])

Ridge regression

# Ridge regression
set.seed(77)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fit <- train(x=trainx, y=trainy, method='ridge', preprocess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
fit
## Ridge Regression 
## 
## 144 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE      Rsquared   MAE     
##   0.000000000  2.286066  0.5084493  1.369764
##   0.007142857  1.828814  0.5243584  1.221142
##   0.014285714  1.672191  0.5359743  1.163832
##   0.021428571  1.582316  0.5444225  1.129514
##   0.028571429  1.521912  0.5509506  1.107115
##   0.035714286  1.477937  0.5562217  1.090175
##   0.042857143  1.444310  0.5606085  1.076822
##   0.050000000  1.417720  0.5643401  1.066649
##   0.057142857  1.396178  0.5675674  1.058259
##   0.064285714  1.378407  0.5703949  1.051740
##   0.071428571  1.363539  0.5728981  1.046324
##   0.078571429  1.350962  0.5751332  1.041552
##   0.085714286  1.340233  0.5771427  1.037317
##   0.092857143  1.331019  0.5789601  1.033655
##   0.100000000  1.323064  0.5806118  1.030542
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
dfr[6,] = data.frame(Model='Ridge', Tuning.Parameters=paste0('labmda=', fit$bestTune[['lambda']]), RMSE=fit$results[['RMSE']])

Lasso regression

# Ridge regression
set.seed(77)
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
fit <- train(x=trainx, y=trainy, method='enet', preprocess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
fit
## Elasticnet 
## 
## 144 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 129, 129, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      1.433201  0.5856073  1.1901593
##   0.00    0.10      1.209237  0.6032783  1.0185864
##   0.00    0.15      1.140548  0.6111652  0.9536177
##   0.00    0.20      1.124102  0.6187659  0.9322020
##   0.00    0.25      1.121581  0.6200730  0.9283358
##   0.00    0.30      1.118237  0.6243936  0.9188952
##   0.00    0.35      1.117536  0.6276211  0.9146605
##   0.00    0.40      1.120802  0.6289250  0.9137445
##   0.00    0.45      1.131374  0.6254270  0.9208721
##   0.00    0.50      1.143630  0.6203770  0.9322169
##   0.00    0.55      1.159585  0.6121583  0.9473379
##   0.00    0.60      1.240437  0.5865593  0.9936258
##   0.00    0.65      1.370591  0.5687979  1.0475752
##   0.00    0.70      1.496816  0.5555727  1.0982473
##   0.00    0.75      1.629331  0.5432079  1.1493911
##   0.00    0.80      1.783242  0.5313500  1.2046776
##   0.00    0.85      1.946187  0.5209446  1.2605442
##   0.00    0.90      2.044545  0.5151547  1.2951275
##   0.00    0.95      2.162856  0.5114413  1.3323160
##   0.00    1.00      2.286066  0.5084493  1.3697641
##   0.01    0.05      1.539145  0.5182418  1.2725262
##   0.01    0.10      1.330375  0.5904373  1.1077079
##   0.01    0.15      1.193655  0.6047618  1.0002440
##   0.01    0.20      1.150271  0.6122303  0.9614231
##   0.01    0.25      1.121654  0.6228151  0.9368904
##   0.01    0.30      1.116869  0.6235431  0.9303592
##   0.01    0.35      1.116861  0.6244385  0.9230302
##   0.01    0.40      1.118073  0.6259194  0.9172130
##   0.01    0.45      1.118784  0.6272024  0.9131786
##   0.01    0.50      1.119599  0.6283309  0.9133613
##   0.01    0.55      1.123014  0.6279402  0.9155330
##   0.01    0.60      1.129705  0.6250074  0.9217190
##   0.01    0.65      1.150129  0.6160037  0.9396739
##   0.01    0.70      1.209070  0.6000021  0.9714812
##   0.01    0.75      1.291145  0.5872609  1.0063842
##   0.01    0.80      1.381933  0.5769310  1.0417455
##   0.01    0.85      1.472187  0.5678927  1.0771760
##   0.01    0.90      1.565296  0.5562961  1.1170458
##   0.01    0.95      1.665349  0.5425771  1.1582725
##   0.01    1.00      1.753162  0.5294971  1.1940172
##   0.10    0.05      1.613736  0.4781463  1.3352216
##   0.10    0.10      1.478266  0.5518656  1.2214140
##   0.10    0.15      1.348772  0.5879584  1.1184758
##   0.10    0.20      1.245124  0.6021145  1.0419137
##   0.10    0.25      1.181267  0.6078709  0.9902300
##   0.10    0.30      1.158867  0.6100634  0.9686527
##   0.10    0.35      1.136495  0.6173264  0.9502379
##   0.10    0.40      1.124044  0.6224835  0.9407894
##   0.10    0.45      1.118174  0.6270984  0.9323668
##   0.10    0.50      1.117726  0.6285291  0.9298664
##   0.10    0.55      1.120394  0.6288807  0.9248968
##   0.10    0.60      1.127845  0.6274251  0.9235546
##   0.10    0.65      1.134053  0.6253379  0.9210281
##   0.10    0.70      1.139854  0.6233843  0.9215161
##   0.10    0.75      1.157962  0.6180590  0.9328568
##   0.10    0.80      1.181786  0.6121834  0.9471762
##   0.10    0.85      1.209329  0.6065917  0.9614510
##   0.10    0.90      1.238325  0.6011691  0.9792009
##   0.10    0.95      1.277395  0.5916046  1.0042866
##   0.10    1.00      1.323064  0.5806118  1.0305422
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.35 and lambda = 0.01.
dfr[7,] = data.frame(
    Model='Lasso (enlastic net)', 
    Tuning.Parameters=paste0('lambda=', fit$bestTune[['lambda']], ', fraction=', fit$bestTune[['fraction']]), 
    RMSE=fit$results[['RMSE']]
)
# Model comparison
dfr %>%
    kbl(caption='Model comparison') %>%
    kable_classic(full_width=F)
Model comparison
Model Tuning.Parameters RMSE
LM 2.286066
LM nvmax=23 1.273597
RLM 40.127835
PLS tuneLength=5 1.676215
PLS tuneLength=20 1.676215
Ridge labmda=0.1 2.286066
Lasso (enlastic net) lambda=0.01, fraction=0.35 1.433201
NA NA NA
NA NA NA
NA NA NA

The stepwise linear model with max parameters=9 was the best-performing model under training conditions, with an RMSE of 1.255.

(d)

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

First, calculate the accuracy based on the test test.

# Rerun the model - stepwise linear model using MASS package/caret
set.seed(77)
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx) / 2, 0)))  # Max number of parameters to use
fit <- train(x=trainx, y=trainy, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid)
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
# Generate y predictions
predy <- predict(fit, newdata=testx)

# Get accuracy results
accuracy(predy, testy)
##                  ME     RMSE      MAE       MPE     MAPE
## Test set -0.8679355 5.979448 2.014883 -2.224604 4.945104

As expected, the RMSE for the test set is higher than that of the training set (1.335979 vs 1.225019, respectively). However, the test RMSE is only slightly higher, and the low MAPE value of 2.53 indicates good model performance.

(e)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

# Extract coefficients from model
dftop <- data.frame(Coefficient=coef(fit$finalModel, fit$bestTune[['nvmax']] + 1))
dftop %>%
    kbl(caption='Model coefficients') %>%
    kable_classic(full_width=F)
Model coefficients
Coefficient
(Intercept) 75.6781250
BiologicalMaterial01 -0.1273122
BiologicalMaterial05 0.1797094
BiologicalMaterial07 -1.7364640
BiologicalMaterial09 0.3004392
BiologicalMaterial11 -0.0097183
ManufacturingProcess02 -0.0245667
ManufacturingProcess03 -7.7851401
ManufacturingProcess04 0.0652808
ManufacturingProcess09 0.3708992
ManufacturingProcess12 0.0001340
ManufacturingProcess13 -0.4367373
ManufacturingProcess15 0.0089976
ManufacturingProcess16 -0.0071973
ManufacturingProcess19 0.0057596
ManufacturingProcess20 0.0004919
ManufacturingProcess30 0.0413937
ManufacturingProcess33 0.1585066
ManufacturingProcess34 7.9789257
ManufacturingProcess36 -374.0649304
ManufacturingProcess37 -0.4717292
ManufacturingProcess38 0.0270597
ManufacturingProcess39 0.1039607
ManufacturingProcess44 0.6414816
ManufacturingProcess21 -0.1454709

As shown, the five most important variables are all those related to the manufacturing process. This works out favorably, as these are elements that can be controlled or adjusted to have an effect on yield, whereas the biological predictors can’t be changed.

(f)

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

# Iterate over top parameters
for (p in rownames(dftop)) {
    if (p != '(Intercept)') {  # do not include the intercept
        plt <- dfchem2 %>%
            ggplot()
        if (p == 'ManufacturingProcess01' | p == 'ManufacturingProcess09') {  # these are quantitative variables
            plt <- plt +
                geom_point(aes(x=eval(sym(p)), y=Yield))
        } else {  # these are categorical variables
            plt <- plt +
                geom_bar(aes(x=eval(sym(p)), y=Yield), stat='identity')
        }
        plt <- plt +
            xlab(p)
        print(plt)
    }
}

  • Based on the scatter plot for ManufacturingProcess09, it is evident that increasing this process consistently produces better yields; each unit increase in ManufacturingProcess09 will realize a 0.56-unit gain in yield.
  • The bar plot for ManufacturingProcess33 shows that optimal yields occur when ManufacturingProcess33 is around 64 units.
  • Likewise, yields are optimal when ManufacturingProcess34 is at 2.5 units.
  • If ManufacturingProcess41 is held at 0, the best yields are observed.
  • ManufacturingProcess01 is interesting in that there are three outliers that seemed to skew the results. Without knowing how the data was collected or what those values represent, it is difficult to say whether they should be removed, but from the scatter plot these likely should have been removed prior to modeling. But our interpretation of the problem was to perform the steps as given in the instructions; in a “real-life” scenario, a more rigorous exploration of the data would have been performed before modeling, likely resulting in removing these data points. Based on the small coefficient (0.09) for this predictor, the model interpreted the process to have a slight positive effect on yield.